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Abstract 

I examine a random network model where nodes are categorized by type and linking 
probabilities can differ across types. I show that as homophily increases (so that the 
probability to link to other nodes of the same type increases and the probability of 
linking to nodes of some other types decreases) the average distance and diameter of 
the network are unchanged, while the average clustering in the network increases. 
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1 Introduction 

Communication advances and the social networking via the Internet have made it much 
easier for individuals to locate others with similar backgrounds and tastes. This can affect 
the formation of social networks. How do such changes in the ability of individuals to 
locate other similar individuals affect social network structure? Answering this question 
requires having models of how homophily, the tendency of nodes to be linked to other nodes 
with similar characteristics, affects social network structure. Homophily is a well-studied 
and prevalent phenomenon that is observed across all sorts of applications and attributes 
including ethnicity, age, religion, gender, education level, profession, political affiliation, 
and other attributes (e.g., see Lazarsfeld and Merton (1954), Blau (1977), Blalock (1982), 
Marsden (1987, 1988), among others, or the survey by McPherson, Cook and Smith-Lovin 
(2001)). Despite the extensive empirical research on homophily, there is little that is known 
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about how homophily changes a network's basic characteristics, such as the average distance 
between nodes, diameter, and clustering. 

This paper examines the following questions. Given is a society of nodes that are parti- 
tioned into a number of different groups where nodes within a group are of the same "type" 
and nodes in different groups are of different types. A network formation process is examined 
that can embody various forms of homophily: the probability of links between pairs of nodes 
can depend on their respective types. Holding the degree distribution constant, how does 
such a network that is formed with substantial homophily compare to a network formed when 
types are ignored? One conjecture is that as homophily increases so that the probability of 
links among nodes of similar types increases and the probability of links across less similar 
types falls, the average distance and diameter of the network will increase since the density 
of links across different types of nodes will be falling. This conjecture turns out to be false. 
Even as the probability of links across types falls, the average distance and diameter are not 
changed even in some extreme cases where the relative probability a link between nodes of 
the same type is arbitrarily more likely than a link among nodes of different types, provided 
some non-vanishing fraction of a node's links are still formed to nodes of other types. In 
contrast, homophily can have a significant impact on clustering. It is shown that substantial 
homophily can lead to nontrivial clustering, while a process with the same expected degrees 
but no homophily exhibits no clustering. 

2 A Model of Network Formation with General forms 
of Homophily and Degree Sequences 

A network G = (N, g) is a graph that consists of a set N = {1, . . . , n} of a finite number n 
of nodes along with a list of edges, g£] which are the pairs of nodes that linked to each other. 

Given that the network might not be connected, I follow Chung and Lu (2002) in defining 
average distance in the network to be the average across pairs of path-connected nodes. In 
particular, let £ g (i,j) be the number of links in the shortest path connecting nodes i and j 
if there is such a path, and let £ g (i,j) be infinity if there is no path between i and j in glQ 
Thus, the average distance in the network is defined as^| 

AD(g) - ^'{ i M(id)¥=oo^g(' l J) 



\{{ij}-t g (i,3)¥>°°}\' 

The diameter of the network is diam(g) = max^jj^^Woo £ g (i, j). 

For the network formation processes considered here, the largest component contains all 
but at most a vanishing fraction of nodes and so these definitions are effectively the same 



1 Formally, g C 2 N such that each element in g has cardinality 2. 

2 Standard definitions, such as path, are omitted. See Jackson (2008) for such definitions. 

3 Self-loops are allowed here, and so under these definitions if there is a self-loop then a node is a distance 
of 1 away from itself. This is irrelevant to the results and simply for convenience. It is easily seen that the 
results are the same if self-loops are ignored or if self-distance is set to 0. If there are no links in the network, 
the AD expression is 0/0 which can be set to take any value. 
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whether we defined them as above, or just work with the largest component of g which is 
either the whole network or almost all of it. 

The clustering of a node i with degree of at least 2 is 

CT ( \ = : 1 ■ i 1 ■>' 1 ,: e 9, {hf} £ 9, {j,f} e g}\ 

i{9) \{{J,J'} : i / J / ./' / i; e g, {i,f} e g}\ 

The average clustering is the average of CLi across nodes % that have degree at least 20 



2.1 A General Random Network Model with Homophily 

The following model is a generalization of the random network model from Chung and Lu 
(2002) to allow nodes to be of different types and to allow heterogeneous probabilities of 
linking across different types. 

A set of nodes N = {1, . . . , n} is partitioned into K groups or types Nx, • • • , Nr. This 
partition captures the characteristics of the nodes, so that all nodes with the same char- 
acteristics are in the same group Nj.. Depending on the application a type might embody 
ethnicity, gender, age, education, profession, etc. in a social setting, or might involve char- 
acteristics of a business in a market network, or might involve some physical characteristics 
of a node in a physical network. 

Also given is a degree sequence {dx, ■ ■ ■ , d n } which indicates the expected degree or num- 
ber of connections of each node. Let 

i 

and 

i 

Note that if d{ = d for all i, then d = d. 

Let Dk = 12i£N k di be the total degree of all nodes of type k. 

A random network is formed according to the following process. For each pair of types 
k and k 1 there is a parameter h^y > 0. This parameter captures the relative proclivity of 
groups k and k' to link to each other. The parameters satisfy J2k' -Dfc'^fcfc' — D for each k. 
A link between nodes i in group k and j in group k' is formed with probability 

hkk'didj/ D. 

Conditions defined below ensure that this expression does not exceed 1. 

In the case where h^k > hf.y for all k and k! ^ k, then there is homophily, so that 
nodes are relatively more likely to form their links to their own types than to other types. If 
hkk' = 1 for all k and k' then types are irrelevant and the model reduces to the usual Chung 
and Lu model. Otherwise, this allows for different patterns of linkings between different 

4 Set clustering to if there are no such nodes. 
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types. If di = d for all i, then this is a generalization of Erdos-Renyi random graphs where 
links are type-dependent @ More generally, the degree distribution could vary across nodes, 
and power-law networks are the special case where the frequency distribution of {d\, . . . , d n } 
has a power distribution where the frequency of degree d is of the form c<i _7 for some range 
ofd. 

An interesting case is where types have some social or spatial geography and type k can 
be represented as a vector x^ G M m for some m and then h^i is decreasing in the distance 
between k and k f ; for example of the form c — f(\xk — Xk'\) where c is a constant and / is an 
increasing function. One can also consider some hierarchy among the fe's with the relative 
probabilities depending on the hierarchy (e.g., see Clauset, Moore and Newman (2008)). 
Another case of interest is where types have a given probability of forming links to their own 
type and a different probability of forming links all other types (e.g., see Copic, Jackson and 
Kirman (2005) and Currarini, Jackson and Pin (2007)). 



2.2 Admissible Models 

The main results consider a growing sequence of network formation models, and so all param- 
eters are indexed by n, the number of nodes. The results use some restrictions on variation 
in expected degrees across nodes and a minimum bound on the proclivity to link across 
groups. A sequence of network formation processes is said to be admissible if the following 
conditions are satisfied. 

First, there exists h > such that hkk'{n) > h for all k and k! for all large enough n. This 
condition does not require that nodes of different types have a probability of linking that is 
bounded below, as a node's degree could be a fixed number independent of n. This lower 
bound simply implies that any given node spreads some of its links on types other than its 
own type. This still allows for extreme homophily, as it can still be that hkk{n) — > oo and 
that the probability of links with own type is becoming infinitely more likely than links with 
some other types. 

Second, the degree sequence satisfies the following: 

• there exists e > such that d(n) > (1+e) log(n) for large enough n and log (d(n)^J / log(n) 


• there exists c > such that he > 1, and M > 0, such that di(n) < Md(n) for all % and 
n, and di(n) > c for all but o(n) nodesl^ 

The first restriction is that the second-order average degree is growing with n, but more 
slowly than n. It grows at a rate fast enough that the giant component includes a fraction 

5 Note, however, that this process allows for self-loops i may connect to i, although the probability of this 
for any node i vanishes as n grows provided df/D vanishes. 

6 Here, h is as defined in the restrictions on proclivity to link across types. These conditions ensure that 
the degree sequence satisfies (i) and (ii) in Chung and Lu (2000). They also guarantee (iii) setting U = N 
and noting that d(n) < M 2 D(n)/n. 
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approaching 1 of the nodes. The second requires that no node have an expected degree that 
explodes relative to the average expected degree and that all but a vanishing fraction of 
nodes have a lower bound on expected degree that is larger than 1. 



3 Diameter and Average Distance in the Model 

Let AD(n, d(n), h(n)) and diam(n, d(n), h(n)) be the average distance and diameter, respec- 
tively, of a graph randomly drawn according to the process above with n nodes, degree se- 
quence d(n) = (di(n), . . . ,d n (n)), and homophily parameters h(n) = {h kk > {n)) kk ,) . This av- 
erage distance and diameter are random variables for each n. Similarly, let AD(n, d(n)) and 
diam(n, d(n)) be the average distance and diameter, respectively, of a graph randomly drawn 
according to the process above with n nodes, degree sequence d(n) = (di(n), . . . , d n (n)), and 
without any homophily (so that hkk>{n) = 1 for all k and k'). 

Theorem 1 Consider an admissible sequence of network formation processes (n, d(n), h(n)). 
Asymptotically almost surely: 

. AD(n,d(n)Mn)) = (l + o(l))log(n)/log(d(n)), and so ^ffgff » -> 1, 

• diam(n, d(n), h(n)) = B (log(n)/log(d(n))) and so diam(n, d(n), h(n)) = (diam(n, d(n))). 



Thus, the average distance and diameter of the admissible processes are not affected by 
homophily. Even though there can be an arbitrarily increased density of links within types, 
and substantial decrease in the density of links across types, this does not impact average 
distance or the diameter in the network. In order for homophily to affect these aspects of the 
network, one would have to have the density of links across most types decrease at a level 
which vanishes relative to overall degree. That is, suppose instead that nodes are grouped 
into evenly sized groups (up to integer constraints) so that hkk'(n) < f( n ) for all k and k! 
with k' k for some f(n) such that f(n)nd(n)/K(n) is bounded above and where K(n)/n 
is bounded away from 0. Then, it is easy to check that0 almost surely, AD ADfali(n))^ ~ > 00 

an d SO diam(n,cHn)Mn)) ^ ^ 
diam(n,a(n)) 

Proof of Theorem [JJ Consider a network formation process such that each node has 
expected degree hdi and h^w — 1 fo r all kk' . This is the process (n, hd(n)), and the process 
(n, h(n), d(n)) is equivalent to a first running the process (n, hd(n)) and then adding some 
additional links. Under the admissibility requirement here, (n,hd(n)) is admissible and 



7 A lower bound on the average distance is that of a graph where all nodes of a given type are agglomerated 
to become a single node. There are K(n) nodes in this graph and each of these type-nodes has degree 
of at most dMf(n)n/K(n) which is bounded above by some C. The average distance is at least order 
log(i<r(n))/log(C) which is proportional to log(n), provided this network has a giant component containing 
all but at most a vanishing fraction of nodes. The average distance could only be smaller than this if the 
connectivity across types drops so low so that the network fragments to smaller components. 



5 



specially admissible under the definitions of Chung and Lu (2002). By Lemma 5 in Chung 
and Lu (2002), almost surely the largest component of a random graph under the process 
(n, hd(n)) contains all but at most o(n) of the nodes. By Theorems 1 and 2 in Chung and 
Lu (2002) the average distance and diameter of this process are almost surely 

(l + o(l))log(n)/log(Hn)) = (l + o(l))log(n)/log(d(n)), 

and 

6 (log(n)/log(/id(n))) = (log(n)/log(d(n))) , 

respectively. Since the process (n, h(n), d(n)) is equivalent to a first running the process 
(n, hd(n)) and then adding some additional links, it then follows directly that a random graph 
generated in this way contains all but at most o{n) of the nodes and has average distance 
and diameter of this process are almost surely bounded above by (1 + o(l)) log(ra)/log(d(n)), 
and some factor times log(n)/log(d(n)), respectively. 

Next, let us show that these are also lower bounds. First, consider a network where all 
nodes have degree no more than M'd(n). Consider any node %. The T-th neighborhood of % 
includes fewer than 

T 



y (M'd(n)) = ± K —L — 

ti K J M'd(n)-1 

nodes. Thus, in order to reach all nodes in the largest component from some node in 
the largest component (which as argued above contains at least (1 — o{n))n nodes) it 
takes at least T(n) = log((l — o(l))n)/ log (M'd(nfj steps to reach every other node in 
the largest component, almost surely. Given that d(n) — > oo, it follows that T(n) > 
(1 — o(l)) log((n)/log (d(nfj. The average distance is thus almost surely at least 

T(n) _ 

(M'd(n)) t/n. 

t=i 

This is at least (1 — o(l))T(n), almost surely. Thus, the lower bound on average distance 
is (1 — o(l)) log(n)/log The diameter is at least the average distance, and so this is 

also a lower bound on diameter. 

Let us now show that with a probability going to 1 all nodes have degree of no more 
than 2Md, and then setting M' = 2M implies the result. This probability is at least 
IIjPr(dj < 2Md). From Fact 1 in Chung and Lu (2002b) it follows that for any given % 
Vx[di < 2Md) > 1 - e ~ Md ^ 3 (bounding E{di) by Md and setting e in their fact to 1). The 
overall probability is then at most (l — e -Md / 3 ) . Given that d > (1 + e) log(n), it follows 
that this expression is at least (l — e Mel ° s(w)/3 ^ (taking M/3 > 1 without loss of generality 
in the definition of M), which goes to 1 since e - Mel °s( n )/ 3 goes to 0. So, with a probability 
of 1 — o(l) the average distance is (1 + o(l)) log(n)/ \og(d(n)). | 
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4 Clustering 

Note that in the model with no homophily if (maxj di(n)) 2 /D(n) -> 0, then the average 
clustering almost surely tends to simply because the most probable link has a proba- 
bility that tends to 0. In contrast, if groups are relatively small (of the order of average 
degree) and there is substantial homophily, then average clustering does not vanish. Thus, 
homophilistic networks exhibit the characteristics of the "small worlds" discussed by Watts 
and Strogatz (1998): nontrivial clustering at the same time as having a diameter on the 
order of a uniformly random graph. 

Theorem 2 Consider a setting such that (i) there is some m > such that for large enough 
n, hkk(n)Dk(n) / D(n) > m for all k, (ii) maxj dj(n)/maxfc |A^| and minj 0^(77,)/ maxj dj(n) 
are each f2(l), and maxj dj(n) > 2. Asymptotically almost surely, average clustering is Q(l). 

The proof of Theorem[2]is straightforward and so only sketched here. Let maxj di(n) / max^ \N}~\ > 
mi > and rninj cfj(n)/ max; di(n) > m 2 > for all large enough n. The probability of a 
link between any two nodes of the same type is at least 

(777.2 maxj dAn)) 2 min^ hkk (m2 maxj dAn)) 2 m (777.2 max; dAn)) 2 m 9 

> — — > : > m 2 rn\rn > 



D maxfc Dk (n) max^ | (n) | maxj di (n) 

for all large enough n. Given that there is a bound m 3 > so that each node has an 
expectation of forming a fraction of at least 777.3 of its links within its own group, and the 
clustering among pairs of nodes that it is linked to of own type is at least m?,mim > 0, it 
follows that the expected clustering of any node is bounded away from (conditional on 
it having degree at least 2). Given that the expected clustering of all nodes are bounded 
away from (conditional on having at least degree 2), and all nodes have expected degree 
bounded away from and so a non-vanishing fraction almost surely end up with degree of 
at least 2, it can then be shown that the average clustering is almost surely above 0. 



5 Discussion 

The results here show that substantial homophily and bias in the way that different types of 
nodes link to each other can be introduced without altering the average distance or diameter 
of a network. On one level this might not have been expected, and yet the proof of this is 
very simple and basically relies on the fact that some rescaling of the degree of a node up to 
a fixed factor does not alter the asymptotic average distance and diameter of the resulting 
networks. This does not mean that this leaves the properties of the network unchanged, 
as we have seen with clustering parameters. Also, as shown in Golub and Jackson (2008), 
networks with substantial homophily can still behave quite differently, so that even though 
diameter and average distance remain unchanged, the speed of learning can decrease by 
orders of magnitude and mixing time on such networks can correspondingly increase by 
orders of magnitude. 
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